import sqlite3
import pandas as pd
Introduction
Structured Query Language (SQL) is a powerful tool for managing and manipulating relational databases. It is essential for data scientists, analysts, and anyone working with large datasets. This chapter will explore the importance of SQL, its applications, and provide example code format to illustrate its utility.
Why SQL is Important
- Data Management: SQL allows for efficient management of large volumes of data. It provides the means to create, read, update, and delete data in a relational database.
- Data Retrieval: With SQL, you can perform complex queries to retrieve specific data from one or more tables, making it easier to analyze and draw insights.
- Data Manipulation: SQL enables the manipulation of data through operations such as sorting, filtering, and aggregating. This is crucial for data cleaning and preprocessing.
- Data Integration: SQL supports the integration of data from different sources, allowing for comprehensive data analysis.
- Standardization: SQL is a standardized language used by most relational database management systems (RDBMS), making it a versatile and essential skill for professionals in the field.
Basic SQL Concepts
Basic SQL Concepts
To illustrate the following SQL concepts, we will use the employees
table with the following data:
Table: employees
id | first_name | last_name | department | salary |
---|---|---|---|---|
1 | Alice | Smith | HR | 60000 |
2 | Bob | Johnson | IT | 80000 |
3 | Charlie | Lee | Sales | 55000 |
4 | David | Kim | HR | 75000 |
5 | Eva | Brown | IT | 65000 |
6 | Frank | Wilson | Sales | 70000 |
7 | Grace | Taylor | HR | 62000 |
8 | Henry | Anderson | IT | 77000 |
9 | Irene | Thomas | Sales | 53000 |
10 | Jack | White | HR | 58000 |
11 | Karen | Harris | IT | 69000 |
12 | Leo | Martin | Sales | 50000 |
13 | Mia | Jackson | HR | 64000 |
14 | Noah | Lee | IT | 72000 |
15 | Olivia | Perez | Sales | 68000 |
16 | Paul | Young | HR | 61000 |
17 | Quinn | King | IT | 76000 |
18 | Rachel | Scott | Sales | 57000 |
19 | Sam | Green | HR | 63000 |
20 | Tina | Adams | IT | 81000 |
SELECT and FROM
The SELECT
statement is used to fetch data from a database, and the FROM
clause specifies the table.
-- Selecting all columns from a table
= """
p
SELECT *
FROM
employees;
"""
pd.read_sql_query(p, con)
id | first_name | last_name | department | salary |
---|---|---|---|---|
1 | Alice | Smith | HR | 60000 |
2 | Bob | Johnson | IT | 80000 |
3 | Charlie | Lee | Sales | 55000 |
4 | David | Kim | HR | 75000 |
5 | Eva | Brown | IT | 65000 |
6 | Frank | Wilson | Sales | 70000 |
7 | Grace | Taylor | HR | 62000 |
8 | Henry | Anderson | IT | 77000 |
9 | Irene | Thomas | Sales | 53000 |
10 | Jack | White | HR | 58000 |
11 | Karen | Harris | IT | 69000 |
12 | Leo | Martin | Sales | 50000 |
13 | Mia | Jackson | HR | 64000 |
14 | Noah | Lee | IT | 72000 |
15 | Olivia | Perez | Sales | 68000 |
16 | Paul | Young | HR | 61000 |
17 | Quinn | King | IT | 76000 |
18 | Rachel | Scott | Sales | 57000 |
19 | Sam | Green | HR | 63000 |
20 | Tina | Adams | IT | 81000 |
-- Selecting specific columns
= """
p
SELECT
first_name,
last_name,
salary FROM
employees;
"""
pd.read_sql_query(p, con)
id | first_name | last_name | salary |
---|---|---|---|
1 | Alice | Smith | 60000 |
2 | Bob | Johnson | 80000 |
3 | Charlie | Lee | 55000 |
4 | David | Kim | 75000 |
5 | Eva | Brown | 65000 |
6 | Frank | Wilson | 70000 |
7 | Grace | Taylor | 62000 |
8 | Henry | Anderson | 77000 |
9 | Irene | Thomas | 53000 |
10 | Jack | White | 58000 |
11 | Karen | Harris | 69000 |
12 | Leo | Martin | 50000 |
13 | Mia | Jackson | 64000 |
14 | Noah | Lee | 72000 |
15 | Olivia | Perez | 68000 |
16 | Paul | Young | 61000 |
17 | Quinn | King | 76000 |
18 | Rachel | Scott | 57000 |
19 | Sam | Green | 63000 |
20 | Tina | Adams | 81000 |
SELECT EXCLUDE and RENAME
You can exclude columns using SELECT
and rename them for clarity.
-- Selecting all but one column
= """
p
SELECT *
EXCLUDE
salary FROM
employees;
"""
pd.read_sql_query(p, con)
id | first_name | last_name | department |
---|---|---|---|
1 | Alice | Smith | HR |
2 | Bob | Johnson | IT |
3 | Charlie | Lee | Sales |
4 | David | Kim | HR |
5 | Eva | Brown | IT |
6 | Frank | Wilson | Sales |
7 | Grace | Taylor | HR |
8 | Henry | Anderson | IT |
9 | Irene | Thomas | Sales |
10 | Jack | White | HR |
11 | Karen | Harris | IT |
12 | Leo | Martin | Sales |
13 | Mia | Jackson | HR |
14 | Noah | Lee | IT |
15 | Olivia | Perez | Sales |
16 | Paul | Young | HR |
17 | Quinn | King | IT |
18 | Rachel | Scott | Sales |
19 | Sam | Green | HR |
20 | Tina | Adams | IT |
-- Renaming columns
= """
p
SELECT
AS fname,
first_name AS lname
last_name FROM
employees;
"""
pd.read_sql_query(p, con)
fname | lname |
---|---|
Alice | Smith |
Bob | Johnson |
Charlie | Lee |
David | Kim |
Eva | Brown |
Frank | Wilson |
Grace | Taylor |
Henry | Anderson |
Irene | Thomas |
Jack | White |
Karen | Harris |
Leo | Martin |
Mia | Jackson |
Noah | Lee |
Olivia | Perez |
Paul | Young |
Quinn | King |
Rachel | Scott |
Sam | Green |
Tina | Adams |
LIMIT and OFFSET
The LIMIT
clause restricts the number of rows returned, and OFFSET
skips rows before beginning to return rows.
-- Limiting the number of rows returned
= """
p
SELECT *
FROM
employees LIMIT
10;
"""
pd.read_sql_query(p, con)
id | first_name | last_name | department | salary |
---|---|---|---|---|
1 | Alice | Smith | HR | 60000 |
2 | Bob | Johnson | IT | 80000 |
3 | Charlie | Lee | Sales | 55000 |
4 | David | Kim | HR | 75000 |
5 | Eva | Brown | IT | 65000 |
6 | Frank | Wilson | Sales | 70000 |
7 | Grace | Taylor | HR | 62000 |
8 | Henry | Anderson | IT | 77000 |
9 | Irene | Thomas | Sales | 53000 |
10 | Jack | White | HR | 58000 |
-- Skipping rows
= """
p
SELECT *
FROM
employees LIMIT 10
5;
OFFSET
"""
pd.read_sql_query(p, con)
id | first_name | last_name | department | salary |
---|---|---|---|---|
6 | Frank | Wilson | Sales | 70000 |
7 | Grace | Taylor | HR | 62000 |
8 | Henry | Anderson | IT | 77000 |
9 | Irene | Thomas | Sales | 53000 |
10 | Jack | White | HR | 58000 |
11 | Karen | Harris | IT | 69000 |
12 | Leo | Martin | Sales | 50000 |
13 | Mia | Jackson | HR | 64000 |
14 | Noah | Lee | IT | 72000 |
15 | Olivia | Perez | Sales | 68000 |
ORDER BY
The ORDER BY
clause sorts the result set.
-- Sorting the result set by salary in ascending order
= """
p
SELECT *
FROM
employees ORDER BY
salary;
"""
pd.read_sql_query(p, con)
id | first_name | last_name | department | salary |
---|---|---|---|---|
12 | Leo | Martin | Sales | 50000 |
9 | Irene | Thomas | Sales | 53000 |
3 | Charlie | Lee | Sales | 55000 |
10 | Jack | White | HR | 58000 |
1 | Alice | Smith | HR | 60000 |
19 | Sam | Green | HR | 63000 |
7 | Grace | Taylor | HR | 62000 |
13 | Mia | Jackson | HR | 64000 |
5 | Eva | Brown | IT | 65000 |
11 | Karen | Harris | IT | 69000 |
18 | Rachel | Scott | Sales | 57000 |
15 | Olivia | Perez | Sales | 68000 |
6 | Frank | Wilson | Sales | 70000 |
14 | Noah | Lee | IT | 72000 |
17 | Quinn | King | IT | 76000 |
8 | Henry | Anderson | IT | 77000 |
4 | David | Kim | HR | 75000 |
2 | Bob | Johnson | IT | 80000 |
20 | Tina | Adams | IT | 81000 |
16 | Paul | Young | HR | 61000 |
-- Sorting in descending order
= """
p
SELECT *
FROM
employees ORDER BY
DESC;
salary
"""
pd.read_sql_query(p, con)
id | first_name | last_name | department | salary |
---|---|---|---|---|
20 | Tina | Adams | IT | 81000 |
2 | Bob | Johnson | IT | 80000 |
8 | Henry | Anderson | IT | 77000 |
17 | Quinn | King | IT | 76000 |
4 | David | Kim | HR | 75000 |
14 | Noah | Lee | IT | 72000 |
6 | Frank | Wilson | Sales | 70000 |
15 | Olivia | Perez | Sales | 68000 |
11 | Karen | Harris | IT | 69000 |
5 | Eva | Brown | IT | 65000 |
13 | Mia | Jackson | HR | 64000 |
19 | Sam | Green | HR | 63000 |
7 | Grace | Taylor | HR | 62000 |
16 | Paul | Young | HR | 61000 |
1 | Alice | Smith | HR | 60000 |
10 | Jack | White | HR | 58000 |
3 | Charlie | Lee | Sales | 55000 |
9 | Irene | Thomas | Sales | 53000 |
18 | Rachel | Scott | Sales | 57000 |
12 | Leo | Martin | Sales | 50000 |
AND, OR, NOT
Logical operators filter
records based on multiple conditions.
-- Using AND, OR, NOT operators
= """
p
SELECT *
FROM
employees WHERE
= 'Sales' AND salary > 50000;
department
"""
pd.read_sql_query(p, con)
id | first_name | last_name | department | salary |
---|---|---|---|---|
6 | Frank | Wilson | Sales | 70000 |
15 | Olivia | Perez | Sales | 68000 |
18 | Rachel | Scott | Sales | 57000 |
Numeric Operations
Perform arithmetic
operations in SQL.
-- Calculating a new column
= """
p
SELECT
first_name,
last_name,
salary, * 1.1 AS new_salary
salary FROM
employees;"""
pd.read_sql_query(p, con)
first_name | last_name | salary | new_salary |
---|---|---|---|
Alice | Smith | 60000 | 66000.0 |
Bob | Johnson | 80000 | 88000.0 |
Charlie | Lee | 55000 | 60500.0 |
David | Kim | 75000 | 82500.0 |
Eva | Brown | 65000 | 71500.0 |
Frank | Wilson | 70000 | 77000.0 |
Grace | Taylor | 62000 | 68200.0 |
Henry | Anderson | 77000 | 84700.0 |
Irene | Thomas | 53000 | 58300.0 |
Jack | White | 58000 | 63800.0 |
Karen | Harris | 69000 | 75900.0 |
Leo | Martin | 50000 | 55000.0 |
Mia | Jackson | 64000 | 70400.0 |
Noah | Lee | 72000 | 79200.0 |
Olivia | Perez | 68000 | 74800.0 |
Paul | Young | 61000 | 67100.0 |
Quinn | King | 76000 | 83600.0 |
Rachel | Scott | 57000 | 62700.0 |
Sam | Green | 63000 | 69300.0 |
Tina | Adams | 81000 | 89100.0 |
LIKE and NOT LIKE
Pattern matching using LIKE
.
-- Pattern matching
= """
p
SELECT *
FROM
employees WHERE
last_name LIKE 'S%';
"""
pd.read_sql_query(p, con)
id | first_name | last_name | department | salary |
---|---|---|---|---|
1 | Alice | Smith | HR | 60000 |
18 | Rachel | Scott | Sales | 57000 |
BETWEEN
Range filtering using BETWEEN
.
-- Filtering within a range
= """
p
SELECT *
FROM
employees WHERE
BETWEEN 40000 AND 60000;
salary
"""
pd.read_sql_query(p, con)
id | first_name | last_name | department | salary |
---|---|---|---|---|
12 | Leo | Martin | Sales | 50000 |
3 | Charlie | Lee | Sales | 55000 |
9 | Irene | Thomas | Sales | 53000 |
10 | Jack | White | HR | 58000 |
1 | Alice | Smith | HR | 60000 |
OFFSET
Skip
a specific number of rows before starting to return rows.
-- Skipping the first 5 rows
= """
p
SELECT *
FROM
employees 5;
OFFSET
"""
pd.read_sql_query(p, con)
id | first_name | last_name | department | salary |
---|---|---|---|---|
6 | Frank | Wilson | Sales | 70000 |
7 | Grace | Taylor | HR | 62000 |
8 | Henry | Anderson | IT | 77000 |
9 | Irene | Thomas | Sales | 53000 |
10 | Jack | White | HR | 58000 |
11 | Karen | Harris | IT | 69000 |
12 | Leo | Martin | Sales | 50000 |
13 | Mia | Jackson | HR | 64000 |
14 | Noah | Lee | IT | 72000 |
15 | Olivia | Perez | Sales | 68000 |
16 | Paul | Young | HR | 61000 |
17 | Quinn | King | IT | 76000 |
18 | Rachel | Scott | Sales | 57000 |
19 | Sam | Green | HR | 63000 |
20 | Tina | Adams | IT | 81000 |
Intermediate SQL Concepts
Joins
Table: employees
id | first_name | last_name | department_id | salary |
---|---|---|---|---|
1 | Alice | Smith | 1 | 60000 |
2 | Bob | Johnson | 2 | 80000 |
3 | Charlie | Lee | 3 | 55000 |
4 | David | Kim | 1 | 75000 |
5 | Eva | Brown | 2 | 65000 |
6 | Frank | Wilson | 3 | 70000 |
7 | Grace | Taylor | 1 | 62000 |
8 | Henry | Anderson | 2 | 77000 |
9 | Irene | Thomas | 3 | 53000 |
10 | Jack | White | 1 | 58000 |
11 | Karen | Harris | 2 | 69000 |
12 | Leo | Martin | 3 | 50000 |
13 | Mia | Jackson | 1 | 64000 |
14 | Noah | Lee | 2 | 72000 |
15 | Olivia | Perez | 3 | 68000 |
16 | Paul | Young | 1 | 61000 |
17 | Quinn | King | 2 | 76000 |
18 | Rachel | Scott | 3 | 57000 |
19 | Sam | Green | 1 | 63000 |
20 | Tina | Adams | 2 | 81000 |
Table: departments
department_id | department_name |
---|---|
1 | HR |
2 | IT |
3 | Sales |
Combine rows from two or more tables based on a related column.
-- Inner join example
= """
p
SELECT
employees.first_name,
employees.last_name,
departments.department_nameFROM
employeesINNER JOIN
ON employees.department_id = departments.department_id;
departments
"""
pd.read_sql_query(p, con)
first_name | last_name | department_name |
---|---|---|
Alice | Smith | HR |
David | Kim | HR |
Grace | Taylor | HR |
Jack | White | HR |
Mia | Jackson | HR |
Paul | Young | HR |
Sam | Green | HR |
Bob | Johnson | IT |
Eva | Brown | IT |
Henry | Anderson | IT |
Karen | Harris | IT |
Noah | Lee | IT |
Quinn | King | IT |
Tina | Adams | IT |
Charlie | Lee | Sales |
Frank | Wilson | Sales |
Irene | Thomas | Sales |
Leo | Martin | Sales |
Olivia | Perez | Sales |
Rachel | Scott | Sales |
CAST
Convert data
from one type to another.
-- Casting a column
= """
p
SELECT
CAST(salary AS DECIMAL(10, 2))
FROM
employees;
"""
pd.read_sql_query(p, con)
salary |
---|
60000.00 |
80000.00 |
55000.00 |
75000.00 |
65000.00 |
70000.00 |
62000.00 |
77000.00 |
53000.00 |
58000.00 |
69000.00 |
50000.00 |
64000.00 |
72000.00 |
68000.00 |
61000.00 |
76000.00 |
57000.00 |
63000.00 |
81000.00 |
Aggregations
Perform calculations
on a set of values.
-- Using aggregation functions
= """
p
SELECT
department_id, COUNT(employee_id) AS num_employees
FROM
employeesGROUP BY
department_id;
"""
pd.read_sql_query(p, con)
department_id | num_employees |
---|---|
1 | 7 |
2 | 7 |
3 | 6 |
GROUP BY and HAVING
Group
rows that have the same values and filter
groups.
-- Grouping rows and filtering groups
= """
p SELECT
department_id,COUNT(employee_id) AS num_employees
FROM
employeesGROUP BY
department_idHAVING COUNT(employee_id) > 5;
"""
pd.read_sql_query(p, con)
department_id | num_employees |
---|---|
1 | 7 |
2 | 7 |
3 | 6 |
UNION, INTERSECT, MINUS
Combine result sets
Table: managers
manager_id | first_name | last_name |
---|---|---|
1 | Michael | Brown |
2 | Sarah | Johnson |
3 | John | Lee |
-- Union example
= """
p SELECT
first_name FROM
employeesUNION
SELECT
first_nameFROM
managers;"""
pd.read_sql_query(p, con)
first_name |
---|
Alice |
Bob |
Charlie |
David |
Eva |
Frank |
Grace |
Henry |
Irene |
Jack |
Karen |
Leo |
Mia |
Noah |
Olivia |
Paul |
Quinn |
Rachel |
Sam |
Tina |
Michael |
Sarah |
John |
POSITION
Find
the position of a substring
.
-- Finding substring position
= """
p SELECT
'e' IN first_name)
POSITION(FROM
employees;"""
pd.read_sql_query(p, con)
position |
---|
0 |
0 |
4 |
0 |
0 |
2 |
0 |
3 |
2 |
0 |
0 |
3 |
0 |
0 |
0 |
0 |
0 |
2 |
0 |
0 |
CASE
Conditional
logic in `SQL.
-- Using CASE statements
= """
p SELECT
first_name,
last_name,CASE
WHEN salary > 60000 THEN 'High'
WHEN salary BETWEEN 40000 AND 60000 THEN 'Medium'
ELSE 'Low'
END AS salary_category
FROM
employees;"""
pd.read_sql_query(p, con)
first_name | last_name | salary_category |
---|---|---|
Alice | Smith | Medium |
Bob | Johnson | High |
Charlie | Lee | Medium |
David | Kim | High |
Eva | Brown | Medium |
Frank | Wilson | High |
Grace | Taylor | Medium |
Henry | Anderson | High |
Irene | Thomas | Medium |
Jack | White | Medium |
Karen | Harris | High |
Leo | Martin | Medium |
Mia | Jackson | Medium |
Noah | Lee | High |
Olivia | Perez | High |
Paul | Young | Medium |
Quinn | King | High |
Rachel | Scott | Medium |
Sam | Green | Medium |
Tina | Adams | High |
Joins
Introduction
It’s rare that a data analysis involves only a single data frame. Typically you have many data frames, and you must join them together to answer the questions that you’re interested in.
pandas has a really rich set of options for combining one or more data frames, with the two most important being concatenate and merge. Some of the examples in this chapter show you how to join a pair of data frames. Fortunately this is enough, since you can combine three data frames by combining two pairs.
# remove cell
import matplotlib_inline.backend_inline
import matplotlib.pyplot as plt
# Plot settings
"https://github.com/aeturrell/python4DS/raw/main/plot_style.txt")
plt.style.use("svg") matplotlib_inline.backend_inline.set_matplotlib_formats(
Prerequisites
This chapter will use the pandas data analysis package.
Concatenate
If you have two or more data frames with the same index or the same columns, you can glue them together into a single data frame using pd.concat()
.
For the same columns, pass axis=0
to glue the index together; for the same index, pass axis=1
to glue the columns together. The concatenate function will typically be used on a list of data frames.
If you want to track where the original data came from in the final data frame, use the keys
keyword.
Here’s an example using data on two different states’ populations that also makes uses of the keys
option:
import pandas as pd
import urllib.request
= "http://www.stata-press.com/data/r14/"
base_url = ["ca", "il"]
state_codes = "pop.dta"
end_url = {'User-Agent': 'Mozilla/5.0'}
headers
def fetch_data(url):
= urllib.request.Request(url, headers=headers)
req with urllib.request.urlopen(req) as response:
return pd.read_stata(response)
# This grabs the two data frames, one for each state
= [fetch_data(base_url + state + end_url) for state in state_codes]
list_of_state_dfs
# Show example of first entry in list of data frames
print(list_of_state_dfs[0])
county pop
0 Los Angeles 9878554
1 Orange 2997033
2 Ventura 798364
# Concatenate the list of data frames
= pd.concat(list_of_state_dfs, keys=state_codes, axis=0) df
Note that the keys
argument is optional, but is useful for keeping track of origin data frames within the merged data frame.
Exercise
Concatenate the follow two data frames:
= pd.DataFrame([['a', 1], ['b', 2]],
df1 =['letter', 'number'])
columns
= pd.DataFrame([['c', 3], ['d', 4]],
df2 =['letter', 'number']) columns
Merge
There are so many options for merging data frames using pd.merge(left, right, on=..., how=...
that we won’t be able to cover them all here. The most important features are: the two data frames to be merged, what variables (aka keys) to merge on (and these can be indexes) via on=
, and how to do the merge (eg left, right, outer, inner) via how=
. This diagram shows an example of a merge using keys from the left-hand data frame:
The how=
keyword works in the following ways: - how='left'
uses keys from the left data frame only to merge. - how='right'
uses keys from the right data frame only to merge. - how='inner'
uses keys that appear in both data frames to merge. - how='outer'
uses the cartesian product of keys in both data frames to merge on.
Let’s see examples of some of these:
= pd.DataFrame(
left
{"key1": ["K0", "K0", "K1", "K2"],
"key2": ["K0", "K1", "K0", "K1"],
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
}
)= pd.DataFrame(
right
{"key1": ["K0", "K1", "K1", "K2"],
"key2": ["K0", "K0", "K0", "K0"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
}
)# Right merge
=["key1", "key2"], how="right") pd.merge(left, right, on
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | K0 | A2 | B2 | C1 | D1 |
2 | K1 | K0 | A2 | B2 | C2 | D2 |
3 | K2 | K0 | NaN | NaN | C3 | D3 |
Note that the key combination of K2 and K0 did not exist in the left-hand data frame, and so its entries in the final data frame are NaNs. But it does have entries because we chose the keys from the right-hand data frame.
What about an inner merge?
=["key1", "key2"], how="inner") pd.merge(left, right, on
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | K0 | A2 | B2 | C1 | D1 |
2 | K1 | K0 | A2 | B2 | C2 | D2 |
Now we see that the combination K2 and K0 are excluded because they didn’t exist in the overlap of keys in both data frames.
Finally, let’s take a look at an outer merge that comes with some extra info via the indicator
keyword:
=["key1", "key2"], how="outer", indicator=True) pd.merge(left, right, on
key1 | key2 | A | B | C | D | _merge | |
---|---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 | both |
1 | K0 | K1 | A1 | B1 | NaN | NaN | left_only |
2 | K1 | K0 | A2 | B2 | C1 | D1 | both |
3 | K1 | K0 | A2 | B2 | C2 | D2 | both |
4 | K2 | K0 | NaN | NaN | C3 | D3 | right_only |
5 | K2 | K1 | A3 | B3 | NaN | NaN | left_only |
Now we can see that the products of all key combinations are here. The indicator=True
option has caused an extra column to be added, called ’_merge’, that tells us which data frame the keys on that row came from.
Exercise
Merge the following two data frames using the left_on
and right_on
keyword arguments to specify a join on lkey
and rkey
respectively:
= pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
df1 'value': [1, 2, 3, 5]})
= pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
df2 'value': [5, 6, 7, 8]})
Exercise
Merge the following two data frames on "a"
using how="left"
as a keyword argument:
= pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})
df1 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]}) df2
What do you notice about the position .loc[1, "c"]
in the merged data frame?
For more on the options for merging, see pandas’ comprehensive merging documentation.
Introduction to SQL Joins
SQL joins are used to combine rows from two or more tables based on a related column between them. Understanding joins is crucial for effective data retrieval and manipulation. This section will cover the four main types of joins: INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN, using visual aids and code snippets with a fake data frame.
Types of Joins
INNER JOIN
An INNER JOIN returns only the rows that have matching values in both tables.
-- Inner join example
= """
p SELECT
id,
A.
A.name,
B.order_idFROM
Customers AINNER JOIN
ON A.id = B.customer_id;
Orders B """
pd.read_sql_query(p, con)
LEFT JOIN
A LEFT JOIN returns all the rows from the left table and the matched rows from the right table. If no match is found, NULL values are returned for columns from the right table.
-- Left join example
= """
p SELECT
id,
A.
A.name,
B.order_idFROM
Customers ALEFT JOIN
ON A.id = B.customer_id;
Orders B """
pd.read_sql_query(p, con)
RIGHT JOIN
A RIGHT JOIN returns all the rows from the right table and the matched rows from the left table. If no match is found, NULL values are returned for columns from the left table.
= """
p SELECT
id,
A.
A.name,
B.order_idFROM
Customers ARIGHT JOIN
ON A.id = B.customer_id;
Orders B """
pd.read_sql_query(p, con)
FULL JOIN
A FULL JOIN returns all the rows when there is a match in either left or right table. Rows without a match in one of the tables will contain NULL values for columns from that table.
-- Full join example
= """
p SELECT
id,
A.
A.name,
B.order_idFROM
Customers AFULL JOIN
ON A.id = B.customer_id;
Orders B """
pd.read_sql_query(p, con)
Example Data Frames
To illustrate these joins, let’s consider two fake data frames: Customers
and Orders
.
import pandas as pd
# Creating a fake data frame for Customers
= pd.DataFrame({
customers 'id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'David']
})
# Creating a fake data frame for Orders
= pd.DataFrame({
orders 'order_id': [101, 102, 103, 104],
'customer_id': [1, 2, 2, 4]
})
# Display the data frames
print("Customers Data Frame")
print(customers)
Customers Data Frame
id name
0 1 Alice
1 2 Bob
2 3 Charlie
3 4 David
print("Orders Data Frame")
print(orders)
Orders Data Frame
order_id customer_id
0 101 1
1 102 2
2 103 2
3 104 4
Conclusion
SQL is an indispensable tool for anyone working with data. Its ability to manage, manipulate, and integrate data makes it essential for data analysis and decision-making. The examples provided in this chapter illustrate some of the key operations and techniques used in SQL, highlighting its importance in the field of data science.